Fault-Tolerant Iterative Methods via Selective Reliability

نویسندگان

  • Mark Hoemmen
  • Michael A. Heroux
چکیده

Current iterative methods for solving linear equations assume reliability of data (no “bit flips”) and arithmetic (correct up to rounding error). If faults occur, the solver usually either aborts, or computes the wrong answer without indication. System reliability guarantees consume energy or reduces performance. As processor counts continue to grow, these costs will become unbearable. Instead, we show that if the system lets applications apply reliability selectively, we can develop iterations that compute the right answer despite faults. These “fault-tolerant” methods either converge eventually, at a rate that degrades gracefully with increased fault rate, or return a clear failure indication in the rare case that they cannot converge. If faults are infrequent, these algorithms spend most of their time in unreliable mode. This can save energy, improve performance, and avoid restarting from checkpoints. We illustrate convergence for a sample algorithm, Fault-Tolerant GMRES, for representative test problems and fault rates.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Analysis of Selective Fault - Tolerant , Hard Real - Time

An increasing number of applications are demanding real-time performance from their multiprocessor systems. For many of these applications, a failure may produce disastrous results. Such failures are avoided in hard real-time systems by the use of fault-tolerance. In hard real-time multiprocessor scheduling, this fault tolerance may be provided by including several task backups in each schedule...

متن کامل

A Microprocessor-Based Hybrid Duplex Fault-Tolerant System

Reliability is one of the fundamental considerations in the design of industrial control equipment. The microprocessor-based Hybrid Duplex fault-tolerant System (HDS) proposed in this paper has high reliability to meet this demand although its hardware structure is simple. The hardware configuration of HDS and the fault tolerance of this system are described. The switching control strategies in...

متن کامل

Fault-tolerant evolvable hardware using field-programmable transistor arrays

The paper presents an evolutionary approach to the design of fault-tolerant VLSI (very large scale integrated) circuits using EHW (evolvable hardware). The EHW research area comprises a set of applications where GA (genetic algorithm) are used for the automatic synthesis and adaptation of electronic circuits. EHW is particularly suitable for applications requiring changes in task requirements a...

متن کامل

A Contribution to the Evaluation of the Reliability of Iterative-Execution Software

This paper deals with the reliability of software executed iteratively, as e.g. in process control applications. The probability of mission survival is evaluated taking account of two characteristics of iterative software: a) system failure is defined in terms of the behaviour of the software over successive iterations, because the controlled system can usually tolerate short bursts of errors; ...

متن کامل

Reliability Growth of Fault - Tolerant Software

Two fault-tolerant software techniques are investigated: recovery block and N-version programming. For each, the stable reliability model is transformed into a model that considers reliability growth via the transformation approach based on the hyperexponential model. Analytic and numeric processing of the transformed models identify the influence of fault removal on the reliability of the faul...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011